Show the code
library(tidyverse)
spotify <- read_csv("~/Downloads/spotify.csv")Nhi Luong
November 21, 2025
In this project, I will explore a dataset about Song characteristics from Spotify.
I chose valence to be numeric response variable and instrumentalness to be binary categorical explanatory variable. Valence describes the musical positiveness of a track. The more positive a track is, the closer the value is to 1.0. Instrumentalness classifies whether a song is instrumental or not.
Research question: Is there a difference in the valence score of songs classified as instrumental and songs not classified as instrumentl? In other words, does instrumentalness affect a song’s valence score?
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
instrumentalness min Q1 median Q3 max mean sd n
1 instrumental 0.1000 0.28425 0.4225 0.57175 0.942 0.4495270 0.2273844 74
2 not instrumental 0.0628 0.35700 0.4965 0.69800 0.934 0.5194762 0.2282755 126
missing
1 0
2 0
spotify |>
ggplot(aes(y = valence, x = instrumentalness, fill = instrumentalness)) +
geom_boxplot(width = 0.25) +
geom_jitter(width = 0.05, alpha = 0.5) +
theme(legend.position = "none") +
labs(title = "Valence Score - Instrumental vs. Not Instrumental",
x = "Instrumentalness",
y = "Valence Score") + coord_flip()Hypothesis Test
Null hypothesis: \(H_0: \mu_{i} - \mu_{ni} = 0\). There is no difference in the mean valence score of instrumental and not instrumental songs. The instrumentalness doesn’t affect the songs’ valence score.
Alternative hypothesis: \(H_A: \mu_{i} - \mu_{ni} \ne 0\). There is a difference in the mean valence score of instrumental and not instrumental songs. The instrumentalness does affect the songs’ valence score.
Welch Two Sample t-test
data: valence by instrumentalness
t = -2.0974, df = 153.57, p-value = 0.0376
alternative hypothesis: true difference in means between group instrumental and group not instrumental is not equal to 0
95 percent confidence interval:
-0.13583447 -0.00406386
sample estimates:
mean in group instrumental mean in group not instrumental
0.4495270 0.5194762
From the t-test, we have the test statistic of -2.097 and p-value of 0.0376.
The 95% confidence interval is -0.136 and -0.004. We are 95% confident that mean valence scores for instrumental songs are between 0.004 and 0.136 points lower than non instrumental songs. There is no 0 within the interval so we know the difference is significant.
There are two conditions for the test: Independent and Normality
Independent: Although the method of collecting the sample is not mentioned, we can safely assume that the observations are independent both within and between groups. Knowing one song’s valence score should not impact another song’s valence score.
Normality: Both groups have sample sizes greater than 30, and there seems to be no big outliers, so normality is met.
I chose top10 to be binary categorical response variable and mode to be binary categorical explanatory variable. top10 tells whether a song is ranked in the top 10 or not. mode indicates whether a track is in a major or minor key.
Research question: Is there a real difference in the proportion of top10 songs with major key compared to those with minor key?
no yes Sum
major 105 22 127
minor 55 18 73
Sum 160 40 200
no yes
major 0.827 0.173
minor 0.753 0.247
Hypothesis Test
Null hypothesis: \(H_0: p_{major} - p_{minor} = 0\). There is no difference in the proportion of top10 songs with major key compared to those with minor key.
Alternative: \(H_A: p_{major} - p_{minor} \ne 0\). There is a difference in the proportion of top10 songs with major key compared to those with minor key.
2-sample test for equality of proportions with continuity correction
data: c(22, 18) out of c(127, 73)
X-squared = 1.1339, df = 1, p-value = 0.2869
alternative hypothesis: two.sided
95 percent confidence interval:
-0.20291093 0.05621694
sample estimates:
prop 1 prop 2
0.1732283 0.2465753
There are 2 conditions for the test: Independent and Normal (Success/Failure)
Independent: Although the method of collecting the sample is not mentioned, we can safely assume that the observations are independent both within and between groups. Knowing one song’s rank should not impact another song’s rank.
Normal: We check for success (ranked in top 10) and failure (not in top 10) in each explanatory group. In the major group, there are 22 successes and 105 failures, both greater than 10. In the minor group, there are 18 successes and 55 failures, also greater than 10. Because there are at least 10 successes and failures in major and minor groups, the condition is met.
I chose trend as categorical response variable and mode as categorical explanatory variable. Variable trend describes how a song moved in the rankings since the previous week (down, up, same, or new entry). Variable mode indicates whether a track is in a major or minor key.
Research question: Is there an association between genre and trend?
MOVE_DOWN MOVE_UP NEW_ENTRY SAME_POSITION Sum
major 56 43 4 24 127
minor 32 22 1 18 73
Sum 88 65 5 42 200
MOVE_DOWN MOVE_UP NEW_ENTRY SAME_POSITION
major 0.441 0.339 0.031 0.189
minor 0.438 0.301 0.014 0.247
** I saw that table of counts have a column for “New_entry” songs. This means there are 5 songs that don’t have a ranking from previous week to have a comparison, so I decided to take out this column out. We have new table of counts, table of proportions and bar graph.
MOVE_DOWN MOVE_UP SAME_POSITION Sum
major 56 43 24 123
minor 32 22 18 72
Sum 88 65 42 195
MOVE_DOWN MOVE_UP SAME_POSITION
major 0.455 0.350 0.195
minor 0.444 0.306 0.250
Hypothesis Test
Null hypothesis: \(H_0:\) There is no association between mode and trend.
Alternative: \(H_A:\) There is an association between mode and trend.
Pearson's Chi-squared test
data: .
X-squared = 0.91107, df = 2, p-value = 0.6341
There are 2 conditions for the test: Independent and Expected counts greater than 5.
Independent: Although the method of collecting the sample is not mentioned, we can safely assume that the observations are independent both within and between groups.
Expected Counts: The conditions are met since the expected counts are over 5 for each cell.
MOVE_DOWN MOVE_UP SAME_POSITION
major 55.50769 41 26.49231
minor 32.49231 24 15.50769
Summary:
spotify |>
ggplot(aes(y = valence, x = instrumentalness, fill = instrumentalness)) +
geom_boxplot(width = 0.25) +
geom_jitter(width = 0.05, alpha = 0.5) +
theme(legend.position = "none") +
labs(title = "Valence Score - Instrumental vs. Not Instrumental",
x = "Instrumentalness",
y = "Valence Score") + coord_flip()